First load required packages and set some global parameters.
Code
library(tidyverse)library(brms)library(tidyboot)library(tidyjson)library(patchwork)library(GGally)library(cowplot)library(BayesFactor)library(aida) # custom helpers: https://github.com/michael-franke/aida-packagelibrary(faintr) # custom helpers: https://michael-franke.github.io/faintr/index.htmllibrary(cspplot) # custom styles: https://github.com/CogSciPrag/cspplotlibrary(ordbetareg) # for ordered beta-regressionlibrary(cmdstanr)################################################### these options help Stan run fasteroptions(mc.cores = parallel::detectCores(),brms.backend ="cmdstanr")# use the CSP-theme for plottingtheme_set(theme_csp())# global color scheme from CSPproject_colors = cspplot::list_colors() |>pull(hex)# setting theme colors globallyscale_colour_discrete <-function(...) {scale_colour_manual(..., values = project_colors)}scale_fill_discrete <-function(...) {scale_fill_manual(..., values = project_colors)}##################################################rerun_models <-FALSELambert_test <-function(loo_comp) {1-pnorm(-loo_comp[2,1], loo_comp[2,2])}##################################################rl_file <-"R_data_4_TeX/myvars-reproduce.csv"myvars =list()# add with: myvars[name] = value
Read & massage the data
Data preprocessing was done in Python. We here load the preprocessed data and rearrange for convenience.
Code
d <-read_csv("../results/round_2.0/results_preprocessed.csv") |># drop column with numbersselect(-`...1`) |># set "non-answers" to AnswerPolarity "positive"mutate(AnswerPolarity =ifelse( AnswerCertainty =="non_answer", "positive", AnswerPolarity)) |># casting into factormutate(group =factor(group, levels =c("relevant", "helpful")),ContextType =factor(ContextType, levels =c("negative", "neutral", "positive")),AnswerPolarity =factor(AnswerPolarity, levels =c("positive", "negative")),AnswerCertainty =factor(AnswerCertainty, levels =c("non_answer", "low_certainty", "high_certainty", "exhaustive")) )
We explored several versions of various relevance metrics (as reported in the paper + appendix). Here we select only those versions of the relevance metrics used for the main analyses as reported in the paper:
Code
d <- d |># columns to keepselect("submission_id","group","StimID","AnswerCertainty","AnswerPolarity","ContextType","attention_score","reasoning_score","prior_sliderResponse","posterior_sliderResponse","posterior_confidence","prior_confidence","relevance_sliderResponse","first_order_belief_change", "second_order_belief_change","pure_second_order_belief_change_scaled","entropy_change_scaled","beta_entropy_change_scaled","kl_scaled","beta_kl_scaled","bayes_factor_utility","beta_bayes_factor_utility_1_scaled" ) |>rename("ProbabilityChange"="first_order_belief_change","CommitmentChange"="second_order_belief_change","ConcentrationChange"="pure_second_order_belief_change_scaled","EntropyChange"="entropy_change_scaled","EntropyChange_2ndOrder"="beta_entropy_change_scaled","KLUtility"="kl_scaled","KLUtility_2ndOrder"="beta_kl_scaled","BayesFactor"="bayes_factor_utility","BayesFactor_2ndOrder"="beta_bayes_factor_utility_1_scaled" )
Data exclusion
As per preregistered protocol, we exclude all data from participants who:
scored less than perfect on all attention checks,
scored less than 0.5 on reasoning tasks, or
have task-sensitivity of not more than 0.75
Task sensitivity is the proportion of critical trials (excluding non-answer trials) in which the change between prior and posterior rating was bigger than 0.05 or there was a non-zero change in confidence rating.
# initial number of participantsinitial_nr_participants <- d |>pull(submission_id) |>unique() |>length()d <- d |>filter(attention_score ==1) |>filter(reasoning_score >0.5) |>filter(task_sensitivity >0.75)# included participantsincluded_nr_participants <- d |>pull(submission_id) |>unique() |>length()message("Initial number of participants: ", initial_nr_participants, "\nIncluded after cleaning: ", included_nr_participants,"\nExcluded: ", initial_nr_participants - included_nr_participants)
Exploring the effect of the experimental factors
The experiment had the following main factors:
ContextType: whether the context made a ‘no’ or a ‘yes’ answer more likely a priori or whether it was neutral (within-subjects)
AnswerCertainty: how much information the answer provides towards a fully resolving answer (within-subjects)
AnswerPolarity: whether the answer suggests or implies a ‘no’ or a ‘yes’ answer (within-subjects)
‘non-answers’ are treated as ‘positive’ for technical purposes, but this does not influence relevant analyses
In the following, we first check whether these experimental manipulations worked as intended.
Sanity-checking whether the manipulations worked as intended
Effects of ContextType on prior and prior confidence
To check whether the ContextType manipulation worked, we compare how participants rated the prior probability of a ‘yes’ answer under each level of the ContextType variable. Concretely, we expect this order of prior ratings for the levels of ContextType: negative < neutral < positive. Although we have no specific hypotheses or sanity-checking questions regarding the confidence ratings, let’s also scrutinize the confidence ratings that participants gave with their prior ratings.
Prior ratings as a function of ContextType
Here is a first plot addressing the question after an effect of ContextType on participants prior ratings.
Code
d |>ggplot(aes(x = prior_sliderResponse, color = ContextType, fill = ContextType)) +geom_density(alpha =0.3, linewidth =1.5) +xlab("prior rating") +ylab("")
We dive deeper by fitting a regression model, predicting prior ratings in terms of the ContextType. Since participants have not seen the answer when they rate the prior probability of a ‘yes’ answer, ContextType is the only fixed effect we should include here. The model also includes the maximal RE structure. We use the ordbetareg package for (slider-data appropriate) zero-one inflated ordinal beta regression.
Our assumption is that prior ratings are higher in contexts classified as ‘neutral’ than in ‘negative’ contexts, and yet higher in ‘positive’ contexts. We use the faintr package to extract information on these directed comparisons.
Here is a visualization of the effect of ContextType on participants’ confidence in their prior ratings.
Code
d |>mutate(prior_confidence =factor(prior_confidence, levels =1:7)) |>ggplot(aes(x = prior_confidence, color = ContextType, fill = ContextType)) +geom_histogram(position ="dodge", stat='count') +xlab("prior confidence") +ylab("")
To scrutinize the effect of ContextType on participants expressed confidence in their prior ratings, we use a ordered-logit (cumulative logit) regression (since prior confidence ratings are from a rating scale).
The results of these comparisons are summarized here:
comparison
measure
posterior
HDI-low
HDI-high
negative < neutral
prior
0.9986667
0.278882
1.149290
neutral < positive
prior
1.0000000
0.528150
1.551521
neutral < negative
prior-confidence
0.9900000
0.176561
1.289040
negative < positive
prior-confidence
0.8025000
-0.485097
1.177040
neutral < positive
prior-confidence
0.9965000
0.355817
1.716866
The ContextType manipulation seems to have worked as expected for the prior ratings: lower in ‘negative’ than in ‘neutral’ than in ‘positive’. As for the confidence ratings, the ContextType manipulation seems to have induced lower confidence for the ‘neutral’ condition than for the ‘negative’ and ‘positive’ condition.
Effects of AnswerPolarity and AnswerCertainty on beliefChange
We can define beliefChange as the difference between posterior and prior in the direction expected from the answer’s polarity (posterior belief in ‘yes’ answer increases for a ‘positive’ answer when compared with the prior rating, but it decreases for ‘negative’ answers). (Careful: we ignore non-answers (which are categorized as “positive” for technical convenience only).) If our manipulation worked, we expect the following for both ‘positive’ and ‘negative’ polarity:
beliefChange is > 0
beliefChange is lower for ‘low certainty’ than for ‘high certainty’ than for ‘exhaustive’
To address the first issue, whether beliefChange is positive for both types of polartiy, we first regress beliefChange against the full list of potentially relevant factors, including plausible RE structures. Notice that at the time of answer the questions related to the posterior, participants have not yet seen the question after relevance or helpfulness, so that factor group should be ommitted.
These results suggest that there is little reason to doubt that the belief changes induces by the answers -as per the experimentally intended manipulation- went in the right direction in all cases.
beliefChange increases with more informative answers
Finally, we investigate whether beliefChange increases with more informative answers, using the same regression model as before.
We see no indication of a main effect of polarity, but find support for the idea that our manipulation of AnswerCertainty induced gradually larger belief changes. I sum, it seems that the stimuli were adequately created to implement the intended manipulation in the variables AnswerCertainty and AnswerPolarity.
Predicting relevance in terms of the experimental factors
We want to explore how relevance ratings depend on the experimental manipulations. We therefore investigate the effects which variables AnswerCertaintyAnswerPolarity and ContextType have on relevance ratings, starting with a visualization:
The table shows results indicating that there are (non-surprising) effects of AnswerType with non-answers rated as least relevant, followed by low-certainty, then high-certainty answers, and final exhaustive answers. There is no (strong) indication for a main effect of AnswerPolarity or ContextType. The lack of an effect of ContextType might be interpreted as prima facie evidence in favor of quantitative notions of relevance that do not take the prior into account (at least not very prominently).
Here is a plot of the relevant posterior draws visually supporting why we compared the three factor levels of ContextType in the way we did (positive is the lowest, neutral the highest, but this difference is still not strongly indicative of a difference (0 included in HDI)):
Research hypotheses 1 and 2 are basic predictions in terms of simple measures of first- and second-order belief change. Research hypothesis 3 is about different notions of quantifying informational relevance.
The hypothesis is that higher belief changes (induced by the answer) lead to higher relevance ratings. We test this hypothesis by a linear beta regression model (with maximal random effects) that regresses relevance ratings against the absolute difference between prior and posterior ratings (ProbabilityChange). We judge there to be evidence in favor of this hypothesis if the relevant slope coefficient is estimated to be credibly bigger than zero (posterior probability > 0.944; an arbitrary value to indicate that there is nothing special about 0.95) and a loo-based model comparison with an intercept only model substantially favors the model that includes the relevant slope.
Hypothesis 2: confidence change additionally contributes to relevance rating
We also hypothesize that change in confidence (CommitmentChange) ratings additionally contributes to predicting relevance ratings. Concretely, we address this hypothesis with a linear beta regression model like for hypothesis 1, but also including the absolute difference in confidence ratings for before and after the answer (and the interaction term). We use the maximal RE-structure. We speak of evidence in favor of this hypothesis if the relevant posterior slope parameter is credibly bigger than zero and a loo-based model comparison favors the more complex model. We speak of evidence against this hypothesis if the loo-based model comparison favors the simpler model.
Hypothesis 3: “Bayes Factor utility” is the best single-factor predictor of relevance ratings
The third hypothesis is that the BayesFactor is a better (single-factor, linear) predictor of relevance_sliderResponse than KLUtility and EntropyChange. We address this hypothesis with LOO cross-validation. We also directly include the exploratory hypothesis 1 here, thus comparing all single-factor models.
# Comparing the best model to the second best modelmyvars["hyp3BFvsBFBetaLOODiff"] <--1* (loo_comp_hyp3[2,1] |>round(3))myvars["hyp3BFvsBFBetaLOOSE"] <- loo_comp_hyp3[2,2] |>round(3)myvars["hyp3BFvsBFBetaPValue"] <-Lambert_test(loo_comp_hyp3) |>round(3)
Yes, there is a noteworthy difference.
We can conclude that first-order BF is the single best predictor of the relevance ratings.
Addressing the exploratory hypotheses
Exploratory Hypothesis 2: adding ConcentrationChange to all first-order measures
To complement our confirmatory hypothesis 2, we also explore whether adding another measure of higher-order uncertainty change adds predictive performance to each first-order measure of belief change. So here we compare, for each measure \(X\) (“entropy change”, “KL”, and “Bayes factor”) for first-order belief change, whether adding the factor ConcentrationChange increases the predictive performance. Concretely, we compare a model with single factor \(X\) as a predictor to a model with predictors \(X\), ConcentrationChange and their interaction. For ease of fitting, no random effects are included.
Exploratory Hypothesis 3: compare all combinations of first- and second-order measures
Finally, we just compare models with all combinations of first- and second-order measures. Questions of interest are:
Which arbitrary combination of first- and second-order measures is the best?
Does it matter to be consistent in the choice of first- and second order measure, i.e., is the performance of “first-order X” always most boosted when we supply it with “second-order X” instead of some other “second-order Y”?
Let’s run the models first. For ease of fitting, no random effects are included.
It seems that the overall best model in this comparison is the one that uses Bayes-factor based measures consistently. The second-order Bayes-factor based measure also seems to be the best to add to the other first-order measures. This also means that it is not the case the “being consisten” in choic of first- and second-order measure is always best.
Saving variables for LaTeX
Visualizing all results:
Code
library(viridis)
Code
d |>ggplot(aes(x = prior_sliderResponse, y = posterior_sliderResponse, color = relevance_sliderResponse)) +geom_point(size =3, alpha =0.5) +scale_color_viridis()
And let’s zoom in on some interesting spots:
Code
d |>ggplot(aes(x = prior_sliderResponse, y = posterior_sliderResponse, color = relevance_sliderResponse)) +geom_jitter(width =0.01, height =0.01, size =3, alpha =0.7) +scale_color_viridis() +coord_cartesian(xlim =c(0, 0.25), ylim =c(0, 0.25))
Note: redo these plots but show delta of relevance_sliderResponse and BF etc.
And we do the same thing with the confidence ratings TODO: Sanity check whether condition with high bias prior and contradictory answer lead to decrease in confidence.
Code
d |>ggplot(aes(x = prior_confidence, y = posterior_confidence, color = relevance_sliderResponse)) +geom_jitter(width =0.4, height =0.4, size =3, alpha =0.4) +scale_color_viridis()
And now we can plot delta confidence and delta probability as x and y TODO: try removing non-answers from these (and other) plots
Code
d |>ggplot(aes(x = ProbabilityChange, y = CommitmentChange, color = relevance_sliderResponse)) +geom_jitter(width =0.0, height =0.3, size =3, alpha =0.4) +scale_color_viridis()
Code
d |>ggplot(aes(x = ProbabilityChange, y = CommitmentChange, color = relevance_sliderResponse)) +geom_jitter(width =0.01, height =0.3, size =3, alpha =0.4) +scale_color_viridis() +coord_cartesian(xlim =c(-0.01, 0.15), ylim =c(-0.3, 4))